Automatic Synthesis of Data Cleansing Activities
نویسندگان
چکیده
Data cleansing is growing in importance among both public and private organisations, mainly due to the relevant amount of data exploited for supporting decision making processes. This paper is aimed to show how model-based verification algorithms (namely, model checking) can contribute in addressing data cleansing issues, furthermore a new benchmark problem focusing on the labour market dynamic is introduced. The consistent evolution of the data is checked using a model defined on the basis of domain knowledge. Then, we formally introduce the concept of universal cleanser, i.e. an object which summarises the set of all cleansing actions for each feasible data inconsistency (according to a given consistency model), then providing an algorithm which synthesises it. The universal cleanser can be seen as a repository of corrective interventions useful to develop cleansing routines. We applied our approach to a dataset derived from the Italian labour market data, making the whole dataset and outcomes publicly available to the community, so that the results we present can be shared and compared with other techniques.
منابع مشابه
Improving Data Cleansing Accuracy - A Model-based Approach
Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects’ behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery pro...
متن کاملCleansing and preparation of data for statistical analysis: A step necessary in oral health sciences research
In many published articles, there is still no mention of quality control processes, which might be an indication of the insufficient importance the researchers attach to undertaking or reporting such processes. However, quality control of data is one of the most important steps in research projects. Lack of sufficient attention to quality control of data might have a detrimental effect on the r...
متن کاملA Conceptual Framework for Data Cleansing – A Novel Approach to Support the Cleansing Process
Data errors occur in various ways when data is transferred from one point to the other. These data errors occur not necessarily from the formation/insertion of data but are developed and transformed when transferred from one process to another along the information chain within the data warehouse infrastructure. The main focus for this study is to conceptualize the data cleansing process from d...
متن کاملUsing Bayesian Networks for Cleansing Trauma Data
Medical data is unique due to its large volume, heterogeneity and complexity. This necessitates costly active participation of medical domain experts in the task of cleansing medical data. In this paper we present a new data cleansing approach that utilizes Bayesian networks to correct errant attribute values. Bayesian networks capture expert domain knowledge as well as the uncertainty inherent...
متن کاملOptimizing Cost Function in Imperialist Competitive Algorithm for Path Coverage Problem in Software Testing
Search-based optimization methods have been used for software engineering activities such as software testing. In the field of software testing, search-based test data generation refers to application of meta-heuristic optimization methods to generate test data that cover the code space of a program. Automatic test data generation that can cover all the paths of software is known as a major cha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013